data <- data %>% mutate( Local_DateTime =
dmy_hm(Local Time), Month = month(Local_DateTime),
WeekdayNum = wday(Local_DateTime), Year = year(Local_DateTime), Date =
as.Date(Local_DateTime), Hour = hour(Local_DateTime), # Extract hour
winter_year = ifelse(Month >= 11, Year, Year - 1), winter_start =
as.Date(paste0(winter_year, “-11-01”)), DSN = as.numeric(Date -
winter_start) )
Add this at the end once we have our results
The main aim of this report was to enable NESO to estimate demand patterns in electricity over the long-term. To do this we considered several ordinary least squares (OLS) regression model each fitted to assess the association between a set of predictor variables and average daily demand. Our predictor variables included both temporal and weather based variables. We then chose a model by comparing important scores such as ____ (which ______). We further adapted it to ensure practicality and simplicity of the model. Finally we analyse the predictive ability of our model by applying cross validation to ensure the model has strong predictive ability and to ensure its not over fitting to the data set.
Prior to fitting any models we examine the dataset used in our analysis. We consider the following predictor variables for our linear model:
We next examine the data visually to highlight any general trends and detect any anomalies. We use scatter plots below to examine how each predictor correlates with demand.
<<<<<<< HEAD## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Key observations from the analysis indicate that wind has no clear correlation with average demand and was therefore removed from the model as it provides no predictive value. Most the solar data are concentrated between 0 and 0.05, with higher solar generation associated with lower average demand, suggesting an inverse relationship. This aligns with expectations, as increased sunlight reduces the need for lighting, lowering electricity demand. Temperature, TO and TE all show a negative linear correlation with average demand, suggesting that demand tends to decrease as temperature rises. This is intuitive since as warmer conditions generally reduce heating needs. TE shows the strongest negative effect and is considered the best temperature measure because it incorporates information from both today and the previous day. Its importance lies in averaging with the past day, capturing a lagged effect that may reflect customers updating heating settings with a delay. The days since november variable follows an overall quadratic distribution, but has a pronounced dip in December which suggests that an interaction between month and DSN should be considered in the model. Month has a clear effect on demand, with the highest levels observes in January and February and the lowest in November and March. Finally, day of the week is an important predictor to include in the model as demand is lower on weekends, highest on weekdays, with Monday through Thursday showing similar mean demand and a slight decline on Fridays. Combining weekdays and weekends reduce the resolution of these important differences. Year also has an impact on demand and will be included as a factor variable. However, this will be stripped out of the model by NESO, who will add their own year effect for future years.
We aim to fit a ordinary least-squares linear regression model of the form \[\mathbf{y} = X \boldsymbol{\beta} + \boldsymbol{\epsilon}.\] In this type of model we assume the following
To check whether a variable is significant in the model or not, we use the following hypothesis test about the model parameters. \[\begin{eqnarray*} H_0&:& \beta_i = 0 \\ H_1&:& \beta_i \neq 0. \end{eqnarray*}\] using the test statistic \(T = \frac{\hat{\beta}_i - \beta_0}{\hat{\sigma}_{\hat{\beta}_i}} \sim t_{n-p}\). The resulting p-value of the test gives the probability of observing a value as or more extreme than the observed test statistic given that \(H_0\) is true. The model summary output gives the p-value of these tests for each of the model parameters. A high p-value suggests that we cannot reject the null hypothesis that \(H_0\) is true, that is that the associated parameter in the model is equal to 0, and the variable has no statistically significant effect in the model.
The \(R^2\) metric measures the proportion of variance explained by the linear model. Formally it is defined as \[\begin{equation*} r^2 = 1 - \frac{\sum_{i}^{n}(y_i - \hat{\mu}_i)^2}{\sum_{i}^{n}(y_i - \overline{y})^2} \end{equation*}\] where \(\mu_i\) is the estimated value of \(y_i\) by the model and \(\bar{y}\) is the sample mean of the data. The \(R^2\) tends to overestimate how well a model is doing, and so to account for this we look at the adjusted \(R^2\) metric which accounts for the number of parameters in the model. \[\begin{equation*} r^2_{\text{adj}} = 1 - \frac{\frac{\sum_{i}^{n}(y_i - \hat{\mu}_i)^2}{(n-p)}} {\frac{\sum_{i}^{n}(y_i - \overline{y})^2}{(n-1)}} \end{equation*}\] Models with a higher adjusted \(R^2\) are preferable over models with a lower one, as this indicates that the model explains a higher proportion of the variability within the data.
The AIC is a metric used to compare the quality of a model for a given dataset. It balances model fit with model complexity by penalising the number of parameters included in the model. Formally, AIC is defined as \[\begin{equation*} AIC=2p-2\ln(L) \end{equation*}\] where \(p\) is the number of parameters in the model and \(L\) is the maximised value of the likelihood function.
Lower AIC indicate a better balance between fit and model complexity. When comparing models, the model with the lowest AIC is generally preferred, as it explains the data well without including unnecessary parameters.
Based on the exploratory data analysis, the final model for predicting average demand was selected to capture the most important predictors and accounting for non-linearities and interactions <<<<<<< HEAD identified in the data while balancing simplicity. The final mode is:
….. (unsure how to write final model)
Table here
======= identified in the data while balancing simplicity. The final model is:not sure if this is the best way to format it but thought I would add it in for now…
\[\begin{eqnarray*} M_F: y_i = \beta_0 + \beta_1 \text{solar}_i + \beta_2 \text{wdaynumber}_i + \beta_3 \text{month}_i + \beta_4 \text{year}_i + \beta_5 \text{DSN}_i + \beta_6 \text{TE}_i + \beta_7 \, \text{DSN}^2\text{:month} + \epsilon_i \end{eqnarray*}\] where \(\epsilon_i \overset{\mathrm{iid}}\sim \mathcal{N}(0, \sigma^2)\).
>>>>>>> origin/mainThe TE variable was selected as the temperature
predictor variable because it incorporates information from both the
current and previous day, capturing potential lagged effects in demand.
This choice was supported quantitatively. When comparing modes that
differed only in the temperature variable, TE produced a
higher \(R^2\), lower RMSE and lower
AIC indicating a better overall model fit and predictive
performance.
| Model | \(R^2\) | Adjusted \(R^2\) | RMSE | AIC |
|---|---|---|---|---|
| Model with temp | 0.8585733 | 0.8569565 | 1533.105 | 62062.24 |
| Model with TO | 0.8478952 | 0.8461564 | 1589.928 | 62319.91 |
| Model with TE | 0.8696681 | 0.8681782 | 1471.741 | 61773.03 |
Solar generation was included because higher solar output was
associated with lower average demand and was statistically significant
in the final model, \(p<0.01\). Day
of the week effects were modelled as a factor to account lower weekend
demand and small variations across weekdays with all factor levels
highly significant, \(p<0.01\). The
variable DSN (days since November) exhibited a quadratic
relation with demand in the earlier analysis so \(DSN^2\) was included to capture this
non-linear trend. Month and Year were included
as factor variables to account for seasonal and annual variation in
demand. A pronounced dip in December motivated the inclusion of an
interaction between DSN^2 and Month, allowing
the model to account for monthly variations in the quadratic effect and
substantially improving the AIC (decrease of 1,846.95).
While the inclusion of all two-way interactions followed by backwards selection yielded a slightly higher adjusted \(R^2\), this model was chosen to favour simplicity, as the additional interaction terms contributed only marginal effects.
<<<<<<< HEAD =======| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 33526.962 | 321.157 | 104.394 | 0.000 |
| TE | -529.501 | 10.654 | -49.700 | 0.000 |
| solar_sarah | -15846.613 | 1141.001 | -13.888 | 0.000 |
| factor(WeekdayNum)1 | 5596.396 | 93.053 | 60.142 | 0.000 |
| factor(WeekdayNum)2 | 6324.780 | 93.121 | 67.920 | 0.000 |
| factor(WeekdayNum)3 | 6360.136 | 93.097 | 68.317 | 0.000 |
| factor(WeekdayNum)4 | 6279.360 | 93.250 | 67.339 | 0.000 |
| factor(WeekdayNum)5 | 5464.630 | 93.218 | 58.622 | 0.000 |
| factor(WeekdayNum)6 | 1169.326 | 93.079 | 12.563 | 0.000 |
| I(DSN^2) | 0.660 | 0.041 | 16.114 | 0.000 |
| factor(Month)2 | 4825.442 | 455.153 | 10.602 | 0.000 |
| factor(Month)3 | 5306.641 | 496.149 | 10.696 | 0.000 |
| factor(Month)11 | 3286.746 | 266.867 | 12.316 | 0.000 |
| factor(Month)12 | 9149.686 | 289.650 | 31.589 | 0.000 |
| factor(Year)1992 | -34.595 | 226.488 | -0.153 | 0.879 |
| factor(Year)1993 | -26.648 | 226.925 | -0.117 | 0.907 |
| factor(Year)1994 | 453.790 | 227.098 | 1.998 | 0.046 |
| factor(Year)1995 | 1502.790 | 226.743 | 6.628 | 0.000 |
| factor(Year)1996 | 2365.578 | 227.285 | 10.408 | 0.000 |
| factor(Year)1997 | 2804.758 | 227.065 | 12.352 | 0.000 |
| factor(Year)1998 | 3134.583 | 227.123 | 13.801 | 0.000 |
| factor(Year)1999 | 3779.208 | 226.878 | 16.657 | 0.000 |
| factor(Year)2000 | 4435.593 | 226.786 | 19.558 | 0.000 |
| factor(Year)2001 | 5206.297 | 226.702 | 22.965 | 0.000 |
| factor(Year)2002 | 5574.093 | 227.401 | 24.512 | 0.000 |
| factor(Year)2003 | 6347.567 | 226.968 | 27.967 | 0.000 |
| factor(Year)2004 | 6576.452 | 226.678 | 29.012 | 0.000 |
| factor(Year)2005 | 5704.218 | 226.901 | 25.140 | 0.000 |
| factor(Year)2006 | 5376.686 | 226.954 | 23.691 | 0.000 |
| factor(Year)2007 | 4761.861 | 227.234 | 20.956 | 0.000 |
| factor(Year)2008 | 4100.558 | 226.620 | 18.094 | 0.000 |
| factor(Year)2009 | 2600.179 | 226.727 | 11.468 | 0.000 |
| factor(Year)2010 | 2864.807 | 228.077 | 12.561 | 0.000 |
| factor(Year)2011 | 2258.601 | 227.157 | 9.943 | 0.000 |
| factor(Year)2012 | 1629.380 | 226.662 | 7.189 | 0.000 |
| factor(Year)2013 | 1119.108 | 227.056 | 4.929 | 0.000 |
| factor(Year)2014 | 488.905 | 227.196 | 2.152 | 0.031 |
| I(DSN^2):factor(Month)2 | -0.708 | 0.053 | -13.357 | 0.000 |
| I(DSN^2):factor(Month)3 | -0.728 | 0.047 | -15.418 | 0.000 |
| I(DSN^2):factor(Month)11 | 0.502 | 0.225 | 2.231 | 0.026 |
| I(DSN^2):factor(Month)12 | -3.808 | 0.079 | -48.028 | 0.000 |
T
To assess model generalizability, we analysed our model using expanding window cross-validation. However, our model includes year as a factor which would be updated by NESO each year based on their subjectively estimated year effect. To mimic this process we altered our cross validation method as follows:
Firstly we fit the model on an initial training set from 1991 up to 2000. Then any the year effect is removed from the trained model. Using this trained model, we compute the ‘yearless predictions’ for 2001. To mimic adding back in a year effect (as NESO would) we assumed a known year effect, using the coefficient estimated from the full dataset. This was added to the yearless predictions’ to compute our final predictions. We then compare these with the observed data for 2001 and obtain the following metrics:
Squared error: \(\text{SE} = (y - \hat{y}_F)^2\)
Interval score: \(\text{IS}(\alpha) = U_F - L_F + \frac{2}{\alpha} (L_F - y) \mathcal{1}\{y \leq L_F\} + \frac{2}{\alpha} (y - U_F) \mathcal{1}\{y > L_F\}\) where \(L_F\) and \(U_F\) denote the lower and upper bound of the prediction interval with coverage probability \(\alpha\).
Dawid-Sebastiani: \(\text{DS} = \frac{(y - \hat{y}_F)^2}{\sigma^2_F} + \log(\sigma^2_F)\)
We then repeat this process, expanding the training set by one year and moving the test set forward one year.
There are drawbacks to this method: for example by assuming the known year effect from the full model, we leak information about the test year into the predictions. However we still choose this method for two reasons: Firstly, it still minimises the disrupt to the chronilogical strucutre of the data. Secondly, expanding the training set at each step, incorporates more historical data helping the model capture long-term, recurring seasonal and weekly trends more effectively.
The observed vs predicted values from our cross validation are plotted in Figure 2. The points lie scattered around the \(y=x\) line indicating accuracy in predictions. It is important to note that there are some points that stray above the \(y=x\) line. However these occur at lower demand levels. Since NESO are particularly interested in accuracy at higher demand levels where shortfalls are likely to occur, this indicates that the model still remains well-suited for accurately estimating average daily demand, especially at high levels.
Using cross validation we find the following predictive scores.
| Model | RMSE | Mean Dawid Sebastiani Score | Mean Interval Score |
|---|---|---|---|
| Final | 1462.544 | 15.58537 | 9085.693 |
To interpret these results its useful to compare it to a baseline model \(M_1\) and a more complex model \(M_2\):
| Model | RMSE | Mean Dawid Sebastiani Score | Mean Interval Score |
|---|---|---|---|
| \(M_1\) | 1872.626 | 16.09661 | 12492.79 |
| \(M_2\) | 1872.626 | 16.09661 | 12492.79 |
Observe that each score for the final model is lower compared to \(M_1\). This suggests higher accuracy in predictions as well as confidence associated with these predictions. On the other hand observe that the scores for the model \(M_2\) with several interactions are marginally smaller than our final model. This indicates that simplifying the model the interest of practicality is justified.
To assess whether our final model is overfitting, we compare the RMSE based on the full set of data with the RMSE derived from the Cross Validation:
| Fitted RMSE | Cross Validation RMSE |
|---|---|
| 1471.741 | 1462.544 |
The fitted RMSE is slightly lower than the cross-validation RMSE, likely due to smaller estimation and prediction sets in cross-validation. However, the difference is small relative to the data scale and not significant, suggesting the model does not perform significantly better on the data it was trained on compared to new data. <<<<<<< HEAD Therefore we conclude the chosen model is not overfitting.
We next use cross-validation to compare the basic model with our chosen model. For each model, we plot predicted demand against observed demand, with color indicating the fold the data belongs to. Ideally, points should align with the black line \(y=x\), indicating accurate predictions.
For the basic model (left), the plot shows a wide dispersion around \(y=x\), forming a broad parallelogram shape. We see significant overestimation at lower observed demand levels and underestimation at higher levels. This indicates the model lacks complexity to capture key patterns.
In contrast, the chosen model (right) aligns more closely with \(y=x\). While we still see some overestimation and underestimation, the predictions are generally more accurate. However, there is a scatter of points above the line which are significant overestimations. This is a limitation of the model. However in general these are scattered and more occasional. On the other hand the model avoids large underestimation and generally maintains consistency around the \(y = x\) line. This highlights that the chosen model better captures demand patterns and generalizes more effectively.
The predictive scores, printed in the table below, further indicate that the chosen model forecasts peak demand levels more accurately.
Firstly, the RMSE for the basic model is \(3702.12\) while for the final model is \(2462.88\). These both seem quite large but we can consider this in terms of the scale of the data. For example, considering it as a percentage relative to the mean peak electricity demand, \(49294.32\), we find the RMSE for the basic model is \(\frac{3702.12}{49294.32} * 100 = 7.51\%\) and the RMSE for the final model is \(\frac{2462.88}{49294.32} * 100 = 4.99\%\). Furthermore, the Dawid Sebastiani score and the interval score are both lower for the final model compared to the basic model. These results indicates that the final model has a stronger performance than the basic model.
To measure the prediction accuracy of the final model across all of the months, the following 3 prediction scores:
These were computed using the 24-fold cross validation scheme as defined earlier. The average of each of these scores, grouped by month, are displayed in the graphs below.
It is clear from all three scores that the prediction accuracy of the model is not consistent across all months. The model makes the least accurate predictions for the month of December, as highlighted by the fact that the prediction score for this month is the highest across all three scores. This may be due to more variation in energy usage in December, due to disruption to standard working hours in the holiday/festive season. Energy usage may increase for example due to Christmas lights.
The model makes the most accurate predictions for the month of February, as highlighted by the fact that the prediction score for this month is the lowest across all three scores.
We next investigate how well the final model predicts for weekdays vs weekends. To do this we first plot the predicted values against observed demand for the weekend and weekdays individually.
Observe that for the weekend (right) the points mostly align closely with \(y=x\) with a slight underestimation as demand increases. On the other hand for weekdays (left), while most points align closely with \(y=x\) for the high demand levels, as demand decreases the values are more scattered. In particular, there is slight underestimation of demand and some significant overestimations of demand. This trend makes sense, as we saw during data exploration that weekdays tend to have higher demand levels. As a result, low demand levels are relatively uncommon, making them harder to predict accurately. Ideally, however, the model would be able to account for weekdays where the demand was uncommonly low or weekend where demand was uncommonly high. This limitation of the model is discussed further in the limitations section.
Despite the strengths of our final model, there are limitations that must be taken into account when interpreting our results.
Firstly, it is clear that our model overestimates the peak electricity demand in the lower tails, when the peak demand is relatively low. However, NESO has expressed that overestimating is less of a priority for them than failing to meet high peak demand, making this less of a concern.
As mentioned above, if the actual demand is low on a weekday, the model is more likely to predict demand inaccurately. While this makes sense, since in general the demand is higher on weekdays, ideally the model would be able to account for this. A potential way of adapting the model to account for this could be considering day-type as a factor rather than weekday index.
As previously mentioned, the residuals of our model are not normally distributed. However this is the least important assumption as we can appeal to the Central Limit Theorem as the size of the sample grows.
Many of the variables are highly correlated with each other (e.g. TO vs TE). This is a drawback of the data we have been given.